# README file for the Replication package for 
**``From Online Job Postings to Economic Insights: A Machine Learning Approach to Structuring Naturally Occurring Data''** by Tatjana Dahlhaus, Reinhard Ellwanger, Gabriela Galassi, and Pierre-Yves Yanni.

## Overview

This replication package provides the code used to generate the figures and results in the paper, which links Canadian online job postings from Indeed to firm-level data from Advan Research using natural language processing (NLP) techniques.

The code is organized in two parts:
1. **Data construction Scripts** (require access to confidential data and cannot be executed without the necessary data agreements, though they are included for transparency and documentation) 
- **Company name matching** using tf-idf and cosine similarity to match inconsistently-declared company names in the online job postings names in the Advan Research Points-of-Interest (POI) dataset.
- **Occupational classification** of job titles into the Canadian National Occupation Classification (NOC) using a pre-trained classifier.
- **Aggregation** for data to construct the figures in the paper.

2. **Public Replication Scripts** (fully runnable with included grouped data) 
- **Nowcasting of official vacancies** using pseudo real-time information from online job postings and the Job Vacancies and Wage Survey (JVWS).
- **Analysis of digital vs. non-digital jobs dynamics** in tech vs. non-tech firms during and after the COVID-19 pandemic.

Due to licensing restrictions, raw data from Indeed and Advan are not included in this archive. However, we provide code to replicate the data processing pipeline (when access is granted) and make available aggregated outputs sufficient to reproduce all figures and tables in the paper.

## Data Availability and Provenance Statements

### Statement about Rights

I certify that the authors of the manuscript **have legitimate access to and permission to use the data** used in this manuscript. 

I certify that the authors of the manuscript **do not have permission to redistribute the original raw data** (Indeed job postings and Advan Research Points-of-Interest data).

Only pre-processed, aggregated, and transformed data derived from these sources are included in the replication package, with permission for replication purposes.

### License for Data

The original job postings data are licensed under an agreement between Indeed and the Bank of Canada, in the framework of the Indeed Policy Partners Program. These data are proprietary and cannot be redistributed.

The points-of-interest data are licensed under an agreement between Advan Research Corporation LLC (previously, Safegraph Inc.) and the Bank of Canada. These data are also proprietary and cannot be redistributed.

The Job Vacancy and Wage Survey (JVWS) data are publicly available from Statistics Canada (https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410037201) and can be freely accessed following their data use terms.

### Summary of Availability

Some data **cannot be made** publicly available.

Confidential data used in this paper and not provided as part of the public replication package will be preserved for 5 years after publication, in accordance with journal policies, subject to the terms of the license agreements with Advan Research and Indeed.

### Details on each Data Source

| Data.Name                      | Data.Files                                        | Location | Provided | Citation                          |
|-------------------------------|---------------------------------------------------|----------|----------|-----------------------------------|
| Indeed Job Postings           | Not applicable                                    | Not applicable    | FALSE     | Indeed Hiring Lab (2024)          |
| Advan POI Firm Data           | Not applicable                                    | Not applicable    | FALSE     | Advan Research Corporation (2024) |
| Job Vacancy and Wage Survey   | JVWS_data_by_NAICS.ipynb                                | nowcasting/    | TRUE     | Statistics Canada (2024)          |
| NOC Digital Classification    | index_digital.csv                              | data/    | TRUE     | Galassi et al. (2024)             |
| Tech Company List             | tech_companies_ca.csv                             | data/    | TRUE     | Authors’ own compilation          |
| Indeed data by Digital and Tech  | weekly_tech_indexedto2019.csv                                | data/    | TRUE     | Indeed Hiring Lab (2024), Advan Research Corporation (2024), and Galassi et al. (2024)  |
| Nowcasting results  | nowcast_mat.dta                                | data/    | TRUE     | Indeed Hiring Lab (2024), Advan Research Corporation (2024), and Statistics Canada (2024) |
| Accuracy results | accuracy_figure.csv | data/ | TRUE | Indeed Hiring Lab (2024), Advan Research Corporation (2024) |

The **Indeed** raw job-level data and **Advan** raw points-of-interest data are confidential and hence not included due to licensing restrictions. Access to Indeed data is restricted to a number of policy-making institutions and NGOs around the world have access to the Indeed data via de Indeed Hiring Lab's Policy Partner Program (for more information, see https://www.hiringlab.org/policy-partners-program/). Researchers interested in access to the data may contact Indeed Hiring Lab at hiringlabinfo@indeed.com. Advan data can be accessed by suscription, researchers interested in access to the data may contact Advan Research at https://advanresearch.com/contact. It can take some months to negotiate data use agreements and gain access to the data. The author will assist with any reasonable replication attempts for five years following publication.

Aggregated series of indexed classified job postings are provided. `data\weekly_tech_indexedto2019.csv` includes series of job postings by digital and tech-firm classification, normalized to the 2019 average = 100. `data\accuracy_figure.csv` provides the percentage matched and percentage correctly classified for different cutoffs.

The classification in digital - non-digital relies on an indices provided in `data\index_digital.csv`; for more details, see Galassi et al. (2024). The classification of companies in tech - non-tech use as seed `data\tech_companies_canada.csv`.

The nowcast exercise relies on the Indeed data classified by industries and averaged monthly (script in `aggregation\indeed_naics_monavg.ipynb`), and monthly vacancy data from the JVWS (script in `figures\JVWS_data_by_NAICS.ipynb`; the data are in the public domain, and can also be directly downloaded using https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1410037201). The inputs for Figure 2 are in `data\nowcast_mat.dta`.

## Dataset list

| Data File                         | Source                          | Notes                                                                                      | Provided |
|----------------------------------|----------------------------------|---------------------------------------------------------------------------------------------|----------|
| `rawdata/patterns_weekly_exported_YYYYMMDD_00000000000X.csv.gz` | Advan Research Corporation       | Not included. Access through license with Advan.                                            | Weekly files       |
| `rawdata/CA_job_postings_YYYYMMDD.csv` | Indeed Hiring Lab (confidential) | Not included. Access through Indeed’s Policy Partner Program.                              | Weekly files; earlier files had different names       |
| `rawdata/CA_duration_YYYYMMDD.csv` | Indeed Hiring Lab (confidential) | Not included. Access through Indeed’s Policy Partner Program.                              | Weekly files; earlier files had different names       |
| `accuracy/df_matches_0.csv` | Indeed Hiring Lab (confidential) | Not included, due to license agreements. | | 
| `figures/JVWS_data_by_NAICS.ipynb`      | Statistics Canada (JVWS)         | Jupyter notebook to download public vacancy data from Statistics Canada.                  | Yes      | 
| `data/accuracy_figure.csv` | Aggregated Indeed classified job postings | Accuracy and percentage matched for different matching cutoffs. | | 
| `data/weekly_tech_indexedto2019.csv`   | Aggregated Indeed classified job postings       | Indexed weekly job postings by digital/tech combinations. Based on proprietary Indeed data. |       |
| `data/nowcast_mat.csv`           | Aggregated Indeed & JVWS         | Nowcasting model matrix combining online postings and JVWS monthly. Input for Figure 2.       |       |
| `data/index_digital.csv`      | Galassi et al. (2024)            | Classification of NOC 2016 occupations into digital categories.                            |       |
| `data/tech_companies_ca.csv`     | Authors’ own compilation         | Manually curated list of 150 Canadian tech companies.                                       |       |

A full description of variables in the `data\` folder is provided in `codebook.csv`, included in this archive.

## Computational requirements

### Software Requirements

The replication package contains one or more programs to install all dependencies.

Portions of the code (.sh files) use bash scripting, which may require Linux.

There are two different parts of the code, with different software requirements. Codes in `company_matching/`, `aggregation/` and `figures/` require the following:

- Mamba package manager (preferred, adapt bash lines for Conda)
- Python 3.10+
- Stata (18 preferred version)

All necessary Python packages are specified in the `env_indeed2.yml`.

Codes in `occupation_classification/` use the [occupationcoder](https://github.com/aeturrell/occupationcoder) library. They require:

- Mamba package manager (preferred, adapt bash lines for Conda)
- Python 3.10 (data preparation step) 
- Python 2.7 (classification step)

All necessary Python packages for the data preparation are specified in the `indeed_noc_python3.yml` (Python 3.10), and for the classification, in `occupationcoder` (Python 2.7, location to determine in next step).

## Description of programs/code

- Programs in `company_matching` will append all Indeed and Advan Research data, perform the fuzzy-matching, and select those the pass the cuttoff; they will also analyze the accuracy of the match after manual validation. The file `company_matching/prepare_data_and_match.sh` will run them all. 
- Programs in `occupation_classification` will append all Indeed data, perform the classification in occupations separating English and French job postings, and append them all together. The file `occupation_classification/occupation_classification_script.sh` will run them all.
- Programs in `aggregation` will generate the aggregate datasets included in `data\`, including monthly vacancies and job postings, and weekly job postings by digital and tech categories, indexed to the average 2019 = 100.
- Programs in `figures` can be run, and will use the aggregate data in `data` and create Figures1,  2 and 3 in the paper. The code `JVWS data by NAICS.ipynb` also extracts the public JVWS data, necessary to run the nowcast.

## Instructions to Replicators

There are two different blocks in this replication package: the data construction and aggregation uses confidential data, so they cannot be run with the inputs of this packate; and the codes to construct Figures 2 and 3 of the paper. 

1. **Data construction Scripts**
**Company name matching**
Codes for this pipeline are in `company_matching/`. Here are the steps to run them, provided you obtain access to the confidential Indeed and Advan data:

1. Create the necessary environment for the matching algorithms:
  a. Install Mamba (https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) if not already installed.
  b. In the terminal, navigate to the `company_matching/` folder.
  c. Create the environment by running:
```bash
mamba env create -f env_indeed2.yml
```
2. In the location chosen to store the data, create folders that respect the following structure:
- raw data
- analysis data (intermediate)
- curated data
- accuracy

3. Edit file paths before execution: Before running the matching pipeline, you must review and edit the data file paths in the Jupyter notebooks. Look for comments marked with `CHANGE` in the code. 

4. Run the full matching pipeline using the provided shell script:
```bash
bash prepare_data_and_match.sh
```
**Occupation classification**
Codes for this pipeline are in `occupation_classification`. Here are the steps to run them, provided you obtain access to the confidential Indeed data:

1. Create the necessary environments for the matching algorithms:
  a. Install Mamba (https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html) if not already installed.
  b. In the terminal, navigate to the `occupation_classification/` folder.
  c. Create the data-processing environment by running:
```bash
mamba env create -f indeed_noc_python3.yml
```
  d. For the occupation coder environment, run: 
```bash
mamba create -n occupationcoder python=2.7
mamba activate occupationcoder
```
Clone the occupation coder from Github:
```bash
git clone https://github.com/aeturrell/occupationcoder.git
cd occupationcoder
pip install .
```
The installed occupation coder library will be visible at `$CONDA_PREFIX/lib/python2.7/site-packages/occupationcoder/`. You will copy classification dictionaries into this folder during runtime.

2. In the location chosen to store the data, create folders that respect the following structure:
- raw data
- analysis data (intermediate)
- curated data

3. Edit file paths before execution: Before running the matching pipeline, you must review and edit the data file paths in the Jupyter notebooks. Look for comments marked with `CHANGE` in the code. 

4. Run the full matching pipeline using the provided shell script:
```bash
bash prepare_data_and_match.sh
```

The code `analyze_matching_results.ipynb` constructs Figure 1 of the paper.

**Data aggregation**
Codes for this pipeline are in `aggregation/`. Here are the steps to run them, provided you obtain access to the confidential Indeed data:

1. Activate the environment env_indeed2, previously created (if not, refer to step 1 of the company matching pipeline):
  a. In the terminal, navigate to the `aggregation/` folder.
  c. Activate the environment by typing:
```bash
mamba activate env_indeed2
```
2. In the location chosen to store the data, create folders that respect the following structure:
- curated data

3. Edit file paths before execution: Before running the matching pipeline, you must review and edit the data file paths in the Jupyter notebooks. Look for comments marked with `CHANGE` in the code. 

4. Open a Jupyter notebook editor (e.g., VS Code), and run `digital_jobs_tech_firms.ipynb`, `indeed_naics_movavg.ipynb` and `figures\JVWS_data_by_NAICS.ipynb` (different location because it can be run without confidential data).

6. Open Stata and run `nowcast_estimation.do`.

2. **Figures**
This pipeline reproduces Figures 1, 2 and 3 in the paper. Run codes in any order.

## List of figures and programs

The provided code reproduces:

- All numbers provided in text in the paper
- All tables and figures in the paper

| Figure/Table #    | Program                  | Output file                      | Note                            |
|-------------------|--------------------------|----------------------------------|---------------------------------|
| Figure 1           | figures/Figure1.ipynb    | accuracy_and_perc_matched.eps                 ||
| Figure 2           | figures/Figure2.ipynb    | nowcast_mspe.eps                       ||
| Figure 3           | figures/Figure2.ipynb    | digital_by_tech_ind.eps, non-digital_by_tech_ind.eps                       ||

## References

Advan Research Corporation. 2024. Weekly Patterns+, proprietary Points-of-Interest data (2019-2025) aquired by the Bank of Canada. Not publicly available.

Galassi, G. and A. Bellatin and V. Chu. 2024. Letting Job Postings Talk: Recent Trends in Digitalization. In Big Data Applications in Labor Economics, Part B (Vol. 52B, pp. 1–33). Research in Labor Economics. Emerald Group Publishing Limited. DOI: 10.1108/S0147-91212024000052B022.

Indeed Hiring Lab. 2024. Proprietary Job Postings Data (2018-2025) provided to the Bank of Canada through the Indeed Hiring Lab Policy Partner Program. Not publicly available.

Statistics Canada. 2024. Table 14-10-0372-01 Job vacancies, payroll employees, and job vacancy rate by industry sector, monthly, unadjusted for seasonality.
